ChatGPT or Grammarly? Evaluating ChatGPT on Grammatical Error Correction Benchmark

Haoran Wu† Wenxuan Wang† Yuxuan Wan † Wenxiang Jiao‡ Michael R. Lyu†

†Department of Computer Science and Engineering, The Chinese University of Hong Kong

1155157061@link.cuhk.edu.hk {wxwang,yxwan9,lyu}@cse.cuhk.edu.hk

‡Tencent AI Lab

Abstract

arXiv:2303.13648v1 [cs.CL] 15 Mar 2023

ChatGPT is a cutting-edge artificial intelli- gence language model developed by OpenAI, which has attracted a lot of attention due to its surprisingly strong ability in answering follow-up questions. In this report, we aim to evaluate ChatGPT on the Grammatical Er- ror Correction (GEC) task, and compare it with commercial GEC product (e.g., Gram- marly) and state-of-the-art models (e.g., GEC- ToR). By testing on the CoNLL2014 bench- mark dataset, we find that ChatGPT performs not as well as those baselines in terms of the automatic evaluation metrics (e.g., F0.5 score), particularly on long sentences. We inspect the outputs and find that ChatGPT goes be- yond one-by-one corrections. Specifically, it prefers to change the surface expression of certain phrases or sentence structure while maintaining grammatical correctness. Human evaluation quantitatively confirms this and suggests that ChatGPT produces less under- correction or mis-correction issues but more over-corrections. These results demonstrate that ChatGPT is severely under-estimated by the automatic evaluation metrics and could be a promising tool for GEC.

1 Introduction

ChatGPT1, the current “super-star” in artificial in- telligence (AI) area, has attracted millions of reg- istered users within just a week since its launch by OpenAI. One of the reasons for ChatGPT being so popular is its surprisingly strong per- formance on various natural language process- ing (NLP) tasks (Bang et al., 2023), including ques- tion answering (Omar et al., 2023), text summariza- tion (Yang et al., 2023), machine translation (Jiao et al., 2023), logic reasoning (Frieder et al., 2023), code debugging (Xia and Zhang, 2023), etc. There is also a trend of using ChatGPT as a writing assis- tant for text polishing.

1 https://chat.openai.com/chat

Despite the widespread use of ChatGPT, it re- mains unclear to the NLP community that to what extent ChatGPT is capable of revising the text and correcting grammatical errors. To fill this research gap, we empirically study the Grammatical Error Correction (GEC) ability of ChatGPT by evalu- ating on the CoNLL2014 benchmark dataset (Ng et al., 2014), and comparing its performance to Grammarly, a prevalent cloud-based English typing assistant with 30 million users daily (Grammarly, 2023) and GECToR (Omelianchuk et al., 2020), a state-of-the-art GEC model. With this study, we aim to answer a research question:

Is ChatGPT a good tool for GEC?

To the best of our knowledge, this is the first study on ChatGPT’s ability in GEC.

We present the major insights gained from this evaluation as below:

ChatGPT performs worse than the baseline systems in terms of the automatic evaluation metrics (e.g., F0.5 score), particularly on long sentences.
ChatGPT goes beyond one-by-one corrections by introducing more changes to the surface expression of certain phrases or sentence struc- ture while maintaining the grammatical cor- rectness.
Human evaluation quantitatively demon- strates that ChatGPT produces less under- correction or mis-correction issues but more over-corrections.

Our evaluation indicates the limitation of relying solely on automatic evaluation metrics to assess the performance of GEC models and suggests that ChatGPT is a promising tool for GEC.

Type Error Correction

Preposition I sat in the talk I sat in on the talk Morphology dreamed dreamt Determiner I like the ice cream I like ice cream Tense/Aspect I like play basketball I like playing basketball Syntax I have not the book I do not have the book Punctuation We met they talked and left We met, they talked and left

Table 1: Different types of error in GEC.

the general public due to its strong ability in an- swering various follow-up questions, correcting inappropriate questions (Zhong et al., 2023), and even refusing illegal questions. While the tech- nical details of ChatGPT have not been released systematically, it is known to be built upon Instruct- GPT (Ouyang et al., 2022) which is trained using instruction tuning (Wei et al., 2022a) and reinforce- ment learning from human feedback (RLHF, Chris- tiano et al., 2017).

Grammatical Error Correction

Grammatical Error Correction (GEC) is a task of correcting different kinds of errors in text such as spelling, punctuation, grammatical, and word choice errors (Ruder, 2022). It is highly demanded as writing plays an important role in academics, work, and daily life. Table 1 presents the illustra- tion of different grammatical errors borrowed from Bryant et al. (2022) in a comprehensive survey on grammatical error correction. In general, gram- matical errors can be roughly classified into three categories: omission errors, such as "on" in the first example; replacement errors, such as "dreamed" for "dreamt" in the second example; and insertion errors, such as "the" in the third example.

To evaluate the performance of GEC, researchers have built various benchmark datasets, which in- clude but are not limited to:

CoNLL-2014: Given the short English texts written by non-native speakers, the task re- quires a participating system to correct all er- rors present in each text.

BEA-2019: It is similar to CoNLL-2014

Table 2: GEC performance of GECToR, Grammarly, and ChatGPT.

2 Background	System	Precision	Recall	F0.5
2.1 ChatGPT	GECToR	71.2	38.4	60.8
ChatGPT is an intelligent chatbot powered by large	Grammarly	67.3	51.1	63.3
language models developed by OpenAI. It has at-	ChatGPT	51.2	62.8	53.1
tracted great attention from industry, academia, and

but introduces a new dataset, namely, the Write&Improve+LOCNESS corpus, which represents a wider range of native and learner English levels and abilities (Bryant et al., 2019).

JFLEG: It represents a broad range of lan- guage proficiency levels and uses holistic flu- ency edits to not only correct grammatical errors but also make the original text more native sounding (Tetreault et al., 2017).

ChatGPT for GEC
1. Experimental Setup
  Dataset. We evaluate the ability of ChatGPT in grammatical error correction on the CoNLL2014 task (Ng et al., 2014) dataset. The dataset is com- posed by short paragraphs that are written by non- native speakers of English, accompanied with the corresponding annotations on the grammatical er- rors. We pulled 100 sentences from the official- combined test set in the alternate folder of the dataset sequentially.
  Evaluation Metric. To evaluate the performance of GEC, we adopt three metrics that are widely used in literature, namely, Precision, Recall, and F0.5 score. Among them, F0.5 score combines both Precision and Recall, where Precision is assigned a higher weight (Wikipedia contributors, 2023a).
  
  Precision
  Recall
  F0.5
  
  Precision
  Recall
  F0.5
  
  Precision
  Recall
  F0.5
  
  GECToR
  76.9
  38.5
  64.1
  
  68.8
  37.5
  58.9
  
  71.8
  38.9
  61.5
  
  Grammarly
  62.5
  60.6
  62.1
  
  68.9
  56.0
  65.9
  
  67.3
  45.3
  61.4
  
  ChatGPT
  58.5
  66.7
  60.0
  
  48.7
  60.7
  50.7
  
  51.0
  62.8
  53.0
  
  System Short Medium Long
  
  Table 3: GEC performance with respect to sentence length.
  
  Specifically, the three metrics are expressed as:
  the grammar correction in the setting and only ask it to correct the ones with correctness prob-
  TP
  Precision =
  TP + FP
  , (1)
  lems (red underline), while leaving the clarity (blue underline), engagement (green underline)
  TP
  Recall =
  TP + FN
  , (2)
  and delivery (purple underline) unchanged. We iterate this process several times until there is no
  F0.5
  1.25 × Precision × Recall
  = , (3)
  0.25 × Precision + Recall
  error detected by Grammarly.
  - GECToR: Besides Grammarly, we also compare
  where TP , FP and FN represent the true posi- tives, false positives and false negatives of the pre- dictions, respectively. We use the scoring program provided by CoNLL2014 official but adapt it to be compatible with the latest Python environment.
  Baselines. In this report, we perform the GEC task on three systems, including:
  - ChatGPT: We query ChatGPT manually rather than using some API due to the instability of ChatGPT. For example, when a query sentence resembles a question or demand, ChatGPT may stop the process of GEC but respond to the “de- mand” instead. After a few trials, we find a prompt that works well for ChatGPT:
    Do grammatical error correction on all the following sentences I type in the conversation.
    We query ChatGPT with this prompt for each test sample.
  - Grammarly: Grammarly is a prevalent cloud- based English typing assistant. It reviews spelling, grammar, punctuation, clarity, engage- ment, and delivery mistakes in English texts, detects plagiarism and suggests replacements for the identified errors (Wikipedia contribu- tors, 2023b). As stated by Grammarly, every day, 30 million people and 50,000 teams around the world use Grammarly with their writing (Grammarly, 2023). When querying Grammarly, we open a text file and paste all the test sam- ples into separate paragraphs. We enable all
    ChatGPT with GECToR (Omelianchuk et al., 2020), a state-of-the-art model on GEC in re- search, which also exhibits good performance on the CoNLL2014 task. We adopt the imple- mentation based on the pre-trained RoBERTa model.
2. Results and Analysis
  Overall Performance. Table 2 presents the over- all performance of the three systems. As seen, ChatGPT obtains the highest recall value, GECToR obtains the highest precision value, while Gram- marly achieves a better balance between the two metrics and results in the highest F0.5 score. These results suggest that ChatGPT tends to correct as many errors as possible, which may lead to more overcorrections. Instead, GECToR corrects only those it is confident about, which leaves many er- rors uncorrected. Grammarly combines the advan- tages of both such that it performs more stably.
  ChatGPT Performs Worse on Long Sentences? To understand which kind of sentences ChatGPT are good at, we divide the 100 test sentences into three equally sized categories, namely, Short, Medium and Long. Table 3 shows the results with respect to sentence length. As seen, the gap be- tween ChatGPT and Grammarly is significantly bridged on short sentences. In contrast, ChatGPT performs much worse on those longer sentences, at least in terms of the existing evaluation metrics.
  ChatGPT Goes Beyond One-by-One Correc- tions. We inspect the output of the three systems, especially those for long sentences, and find that
  System Sentence
  Source For an example , if exercising is helpful for family potential disease , we can always look for more chances for the family to go exercise .
  Reference For example , if exercising (OR exercise) is helpful for a potential family disease
  , we can always look for more chances for the family to do exercise .
  GECToR For example , if exercising is helpful for family potential disease , we can always look for more chances for the family to go exercise .
  Grammarly For example , if exercising is helpful for a family ’s potential disease , we can always look for more chances for the family to go exercise .
  ChatGPT For example , if exercise is helpful in preventing potential family diseases , we can always look for more opportunities for the family to exercise .
  Table 4: Comparison of the outputs from different GEC systems.
  
  Table 5: GEC performance with Grammarly for further correction.
  
  ChatGPT is not limited to correcting the errors in the one-by-one fashion. Instead, it is more will- ing to change the superficial expression of some phrases or the sentence structure. For example, in Table 4, GECToR and Grammarly make mi- nor changes to the source sentence (i.e., “an ex- ample” to “example”, “family potential disease” to “a family ’s potential disease”), while ChatGPT modifies the sentence structure (i.e., “for family potential disease” to “in preventing potential fam- ily diseases”) and word choice (i.e., “chances” to “opportunities”). It indicates that the outputs by ChatGPT maintain the grammatical correctness, al- though they do not follow the original expression of the source sentences.
  To validate our hypothesis, we let Grammarly to further correct the grammatical errors in the out- puts of GECToR and ChatGPT. Table 5 lists the results. We can observe that Grammarly introduces a negligible improvement to the output of ChatGPT, demonstrating that ChatGPT indeed generates cor- rect sentences. On the contrary, Grammarly further improves the performance of GECToR noticeably (i.e., +2.1 F0.5, +16.5 Recall), suggesting that there are still many errors in the output of GECToR.
  Table 6: Number of under-correction (Under), mis- correction (Mis) and over-correction (Over) produced by different GEC systems.
  
  System
  Precision
  Recall
  F0.5
  
  System
  #Under
  #Mis
  #Over
  GECToR
  71.2
  38.4
  60.8
  
  GECToR
  13
  4
  0
  + Grammarly
  -5.9
  +16.5
  +2.1
  
  Grammarly
  14
  0
  1
  ChatGPT
  51.2
  62.8
  53.1
  
  ChatGPT
  3
  3
  30
  + Grammarly
  +0.4
  +0.8
  +0.5
  
  Human Evaluation. We conduct a human eval- uation to further demonstrate the potential of Chat- GPT for the GEC task. Specifically, we fol- low Wang et al. (2022) to manually annotate the issues in the outputs of the three systems, includ- ing 1) Under-correction, which is the grammati- cal errors that are not found; 2) Mis-correction, which is the grammatical errors that are found but modified incorrectly; it can be either grammati- cally incorrect or semantically incorrect; 3) Over- correction, which is the other modifications beyond the changes in the reference. We sample 20 sen- tences out of the 100 test sentences and ask two annotators to identify the issues. Table 6 shows the results. Obviously, ChatGPT has the least num- ber of under-corrections among the three systems and fewer number of mis-corrections compared with GECToR, which suggests its great potential in grammatical error correction. Meanwhile, Chat- GPT produces more over-corrections, which may come from the diverse generation ability as a large language model. While this usually leads to a lower F0.5 score, it also allows more flexible language expressions in GEC.
  Discussions. We have checked the outputs corre- sponding to the results of Table 5, and observed
  different behaviors of ChatGPT and Grammarly. The slight improvement (i.e., +0.5 F0.5) by Gram- marly mainly comes from punctuation problems. ChatGPT is not sensitive to punctuation problems but Grammarly is, though the modifications are not always correct. For example, when we manually undo the corrections on punctuation, the F0.5 score increases by +0.0015. Other than punctuation prob- lems, Grammarly also corrects a few grammatical errors on articles, prepositions, and plurals. How- ever, these corrections usually require Grammarly to repeat the process twice. Take the following sentence as an example,
  
  ... constructs of the family and kinship are a social construct,
  ...
  
  Grammarly first changes it to
  
  ... constructs of the family and kinship are a social constructs,
  ...
  
  Then, changes it to
  
  ... constructs of the family and kinship are social constructs,
  ...
  
  Nonetheless, it does correct some errors that Chat- GPT fails to correct.
Conclusion

This paper evaluates ChatGPT on the task of Gram- matical Error Correction (GEC). By testing on the CoNLL2014 benchmark dataset, we find that Chat- GPT performs worse than a commercial product Grammarly and a state-of-the-art model GECToR in terms of automatic evaluation metrics. By ex- amining the outputs, we find that ChatGPT dis- plays a unique ability to go beyond one-by-one corrections by changing surface expressions and sentence structure while maintaining grammatical correctness. Human evaluation results confirm this finding and reveals that ChatGPT produces fewer under-correction or mis-correction issues but more over-corrections. These results demonstrate the limitation of relying solely on automatic evaluation metrics to assess the performance of GEC models and suggest that ChatGPT has the potential to be a valuable tool for GEC.

Limitations and Future Works

There are several limitations in this version, which we leave for future work:

More Datasets: In this version, we only use the CoNLL-2014 test set and only randomly select 100 sentences to conduct the evaluation. In our future work, we will conduct experiments on more datasets.
More Prompt and In-context Learning: In this version, we only use one prompt to query ChatGPT and do not utilize the advanced tech- nology from the in-context learning field, such as providing demonstration examples (Brown et al., 2020) or providing chain-of-thought (Wei et al., 2022b), which may under-estimate the full po- tential of ChatGPT. In our future work, we will explore the in-context learning methods for GEC to improve its performance.
More Evaluation Metrics: In this version, we only adopt Precision, Recall and F0.5 as evalu- ation metrics. In our future work, we will uti- lize more metrics, such as pretraining-based met- rics (Gong et al., 2022) to evaluate the perfor- mance comprehensively.

References

Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wen- liang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Zi- wei Ji, Tiezheng Yu, Willy Chung, Quyet V. Do, Yan Xu, and Pascale Fung. 2023. A multitask, multilin- gual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. ArXiv.

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. NeurIPS.

Christopher Bryant, Mariano Felice, Øistein E. An- dersen, and Ted Briscoe. 2019. The bea-2019 shared task on grammatical error correction. In BEA@ACL.

Christopher Bryant, Zheng Yuan, Muhammad Reza Qorib, Hannan Cao, Hwee Tou Ng, and Ted Briscoe. 2022. Grammatical error correction: A survey of the state of the art. ArXiv.

Paul Francis Christiano, Jan Leike, Tom B. Brown, Mil- jan Martic, Shane Legg, and Dario Amodei. 2017. Deep reinforcement learning from human prefer- ences. NeruIPS.

Simon Frieder, Luca Pinchetti, Ryan-Rhys Grif- fiths, Tommaso Salvatori, Thomas Lukasiewicz, Philipp Christian Petersen, Alexis Chevalier, and J J Berner. 2023. Mathematical capabilities of chatgpt. ArXiv.

Peiyuan Gong, Xuebo Liu, Heyan Huang, and Min Zhang. 2022. Revisiting grammatical error correc- tion evaluation and beyond. EMNLP.

Grammarly. 2023. Grammarly website about us page. Wenxiang Jiao, Wenxuan Wang, Jen tse Huang, Xing‌

Wang, and Zhaopeng Tu. 2023. Is ChatGPT a good translator? a preliminary study. In ArXiv.

Hwee Tou Ng, Siew Mei Wu, Ted Briscoe, Christian Hadiwinoto, Raymond Hendy Susanto, and Christo- pher Bryant. 2014. The conll-2014 shared task on grammatical error correction. In CoNLL.

Reham Omar, Omij Mangukiya, Panos Kalnis, and Es- sam Mansour. 2023. Chatgpt versus traditional ques- tion answering for knowledge graphs: Current status and future directions towards knowledge graph chat- bots. ArXiv.

Kostiantyn Omelianchuk, Vitaliy Atrasevych, Artem N. Chernodub, and Oleksandr Skurzhanskyi. 2020. Gector – grammatical error correction: Tag, not rewrite. In Workshop on Innovative Use of NLP for Building Educational Applications.

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car- roll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instruc- tions with human feedback. arXiv.

Sebastian Ruder. 2022. NLP-progress.

Joel R. Tetreault, Keisuke Sakaguchi, and Courtney Napoles. 2017. Jfleg: A fluency corpus and bench- mark for grammatical error correction. In EACL.

Wenxuan Wang, Wenxiang Jiao, Yongchang Hao, Xing Wang, Shuming Shi, Zhaopeng Tu, and Michael Lyu. 2022. Understanding and improving sequence-to- sequence pretraining for neural machine translation. In ACL.

Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. 2022a. Finetuned language models are zero-shot learners. ICLR.

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Huai hsin Chi, Quoc Le, and Denny Zhou. 2022b. Chain of thought prompting elicits reasoning in large language models. NeurIPS.

Wikipedia contributors. 2023a. F-score — Wikipedia, the free encyclopedia. [Online; accessed 5-March- 2023].

Wikipedia contributors. 2023b. Grammarly — Wikipedia, the free encyclopedia. [Online; accessed 2-March-2023].

Chun Xia and Lingming Zhang. 2023. Conversational automated program repair. ArXiv.

Xianjun Yang, Yan Li, Xinlu Zhang, Haifeng Chen, and Wei Cheng. 2023. Exploring the limits of chat- gpt for query or aspect-based text summarization. ArXiv.

Qihuang Zhong, Liang Ding, Juhua Liu, Bo Du, and Dacheng Tao. 2023. Can chatgpt understand too? a comparative study on chatgpt and fine-tuned bert. ArXiv.

	Precision	Recall	F0.5	Precision	Recall	F0.5	Precision	Recall	F0.5
GECToR	76.9	38.5	64.1	68.8	37.5	58.9	71.8	38.9	61.5
Grammarly	62.5	60.6	62.1	68.9	56.0	65.9	67.3	45.3	61.4
ChatGPT	58.5	66.7	60.0	48.7	60.7	50.7	51.0	62.8	53.0

ChatGPT or Grammarly? Evaluating ChatGPT on Grammatical Error Correction Benchmark

Haoran Wu† Wenxuan Wang† Yuxuan Wan † Wenxiang Jiao‡ Michael R. Lyu†

Abstract

1 Introduction

Type Error Correction

Grammatical Error Correction

CoNLL-2014: Given the short English texts written by non-native speakers, the task re- quires a participating system to correct all er- rors present in each text.

BEA-2019: It is similar to CoNLL-2014

ChatGPT for GEC

Experimental Setup

GECToR: Besides Grammarly, we also compare

Baselines. In this report, we perform the GEC task on three systems, including:

Results and Analysis

ChatGPT Goes Beyond One-by-One Correc- tions. We inspect the output of the three systems, especially those for long sentences, and find that

System Sentence

Discussions. We have checked the outputs corre- sponding to the results of Table 5, and observed

Conclusion

Limitations and Future Works

More Datasets: In this version, we only use the CoNLL-2014 test set and only randomly select 100 sentences to conduct the evaluation. In our future work, we will conduct experiments on more datasets.

References